Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Check if item version is already preserved before bagging (Issue #102) #103

Merged

Conversation

HafeezOJ
Copy link
Collaborator

@HafeezOJ HafeezOJ commented Jun 17, 2024

Description

During preprocessing, this PR checks if a bag exists in AP Trust and Wasabi S3 bucket. It compares the hash of the current item version being prepared for bagging with the item's version hash in AP Trust if the item version has already been preserved. The article version will be skipped if a match is found else its bag will be updated. All activities are logged.

NOTE: This feature may sometimes put a name other than the first author's name in the eventual preservation package file due to the metadata sorting during metadata hash computation.

PROPOSED SOLUTION: Ignore authors' list during sorting while computing metadata hash. This is not included in this PR.

See #93

Documentation Update

  • I have updated README.md and other relevant documentation
  • No documentation update is needed

Implementation Notes

This PR contains Utils.py in the figshare directory which houses utility functions. The following functions are available in this PR:

  • standardize_api_result: This standardizes results from APIs. It replaces 'null' with an empty string to have a constant value for empty fields in API results.
  • sorter_api_result: This recursively sorts results from APIs. The sorting ensures a consistent arrangement of fields in API results to facilitate the generation of a constant hash for the result at any time.
  • get_preserved_version_hash_and_size: This connects to AP Trust to extract hash and size of an already preserved article version if one exists.
  • compare_hash: This compares the hash value of the current article version in pre-processing with the hash of the article version in AP Trust.
  • check_wasabi: This checks if an article version has already been preserved in Wasabi S3 bucket.

Bag checks are carried out in Article.py and Collection.py inside the figshare directory. Logging is done in app.py

@HafeezOJ HafeezOJ linked an issue Jun 17, 2024 that may be closed by this pull request
1 task
@HafeezOJ HafeezOJ requested a review from zoidy June 17, 2024 22:40
Copy link
Collaborator

@zoidy zoidy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before I can continue testing, please address

  • The issues in my test results sheet.
  • The comments in the code

README.md Outdated Show resolved Hide resolved
figshare/Article.py Outdated Show resolved Hide resolved
figshare/Utils.py Outdated Show resolved Hide resolved
figshare/Article.py Outdated Show resolved Hide resolved
figshare/Article.py Outdated Show resolved Hide resolved
app.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@zoidy zoidy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The summary log behavior for collections is different than for articles. E.g. when a collection is already in Wasabi, a warning is output for collections but not for articles. Also the counters and text are somewhat different. For consistency, they should be the same ( minus the parts about matching obviously)

figshare/Article.py Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
app.py Outdated Show resolved Hide resolved
app.py Show resolved Hide resolved
figshare/Article.py Outdated Show resolved Hide resolved
app.py Outdated Show resolved Hide resolved
figshare/Utils.py Outdated Show resolved Hide resolved
Copy link
Collaborator Author

@HafeezOJ HafeezOJ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your comments have been implemented. You can continue to review.

Copy link
Collaborator

@zoidy zoidy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The code seems to fail on embargoed content. Please see the spreadsheet.
  2. I also further cleaned up and prettified the log outputs. The counts haven't changed, I just made them display more consistently (I think). Have a look to see if it makes sense

@HafeezOJ
Copy link
Collaborator Author

HafeezOJ commented Sep 12, 2024

  1. I tried to recreate the error but embargoed items were preserved and skipped when they were already preserved. It seems you forgot to set up a curation staging directory for the embargoed item which might have caused the failure. Try to set up a curation staging directory and try again.
  2. The log output makes sense. I understand what the messages are saying.

@zoidy
Copy link
Collaborator

zoidy commented Sep 12, 2024

  1. I tried to recreate the error but embargoed items were preserved and skipped when they were already preserved. It seems you forgot to set up a curation staging directory for the embargoed item which might have caused the failure. Try to set up a curation staging directory and try again.
  • It can't be an issue with curation staging directory because the error occurs during the fetching stage, not the matching stage. In any case, I checked the curation folder just to be sure and indeed, the embargoed item is there. See my slack message.
  1. The log output makes sense. I understand what the messages are saying.
  • Excellent!

  • 3. Additional issue: Size calculation is incorrect when items are skipped

Copy link
Collaborator Author

@HafeezOJ HafeezOJ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sizes of skipped items have been excluded from size calculation for space check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature: Check if bag exists in AP Trust prior to bagging during preprocessing
2 participants